Unsupervised Separation of Transliterable and Native Words for Malayalam

نویسنده

  • Deepak P
چکیده

Differentiating intrinsic language words from transliterable words is a key step aiding text processing tasks involving different natural languages. We consider the problem of unsupervised separation of transliterable words from native words for text in Malayalam language. Outlining a key observation on the diversity of characters beyond the word stem, we develop an optimization method to score words based on their nativeness. Our method relies on the usage of probability distributions over character n-grams that are refined in step with the nativeness scorings in an iterative optimization formulation. Using an empirical evaluation, we illustrate that our method, DTIM, provides significant improvements in nativeness scoring for Malayalam, establishing DTIM as the preferred method for the task.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving the accuracy of pronunciation lexicon using Naive Bayes classifier with character n-gram as feature: for language classified pronunciation lexicon generation

This paper looks at improving the accuracy of pronunciation lexicon for Malayalam by improving the quality of front end processing. Pronunciation lexicon is an in evitable component in speech research and speech applications like TTS and ASR. This paper details the work done to improve the accuracy of automatic pronunciation lexicon generator (APLG) with Naive Bayes classifier using character n...

متن کامل

Influence of Native and Non-Native Multitalker Babble on Speech Recognition in Noise

The aim of the study was to assess speech recognition in noise using multitalker babble of native and non-native language at two different signal to noise ratios. The speech recognition in noise was assessed on 60 participants (18 to 30 years) with normal hearing sensitivity, having Malayalam and Kannada as their native language. For this purpose, 6 and 10 multitalker babble were generated in K...

متن کامل

CUSAT_NLP@DPIL-FIRE2016: Malayalam Paraphrase Detection

This paper describes an approach for paraphrase detection in Malayalam sentences developed as part of FIRE 2016 Shared Task on Paraphrase detection in Indian Languages. The task of paraphrasedetection is finding a sentence with the same meaning of another sentence expressed using same or different words. This detection is done by a semantic approach which is language dependent. Individual words...

متن کامل

A Hybrid Statistical Approach for Named Entity Recognition for Malayalam Language

Named-Entity Recognition (NER) plays a significant role in classifying or locating atomic elements in text into predefined categories such as the name of persons, organizations, locations, expression of times, quantities, monetary values, temporal expressions and percentages. Several Statistical methods with supervised and unsupervised learning have applied English and some other Indian languag...

متن کامل

Comparison school bonding and interpersonal problems in students with unsupervised and abused families with normal

This study aimed to compare the school bonding and interpersonal problems in students with unsupervised and abused families with normal families in Bandar Lengeh. The sample consisted of 152 normal students and 81 unsupervised or abused students. Normal students were selected by the multi-stage cluster sampling method. Data were collected through two questionnaires: school bonding (Rezaei Shari...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018